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ABSTRACT 

One of the main organizing principles in real-world social, infor- 
mation and technological networks is that of network communities, 
where sets of nodes organize into densely linked clusters. Even 
though detection of such communities is of great interest, under- 
standing the structure communities in large networks remains rela- 
tively limited. Due to unavailability of labeled ground-truth data it 
is practically impossible to evaluate and compare different models 
and notions of communities on a large scale. 

In this paper we identify 6 large social, collaboration, and in- 
formation networks where nodes explicitly state their community 
memberships. We define ground- truth communities by using these 
explicit memberships. We then empirically study how such ground- 
truth communities emerge in networks and how they overlap. We 
observe some surprising phenomena. First, ground-truth communi- 
ties contain high-degree hub nodes that reside in community over- 
laps and link to most of the members of the community. Second, the 
overlaps of communities are more densely connected than the non- 
overlapping parts of communities, in contrast to the conventional 
wisdom that community overlaps are more sparsely connected than 
the communities themselves. 

Existing models of network communities do not capture dense 
community overlaps. We present the Community -Affiliation Graph 
Model (AGM), a conceptual model of network community struc- 
ture, which reliably captures the overall structure of networks as 
well as the overlapping nature of network communities. 
Categories and Subject Descriptors: H.2.8 [Database Manage- 
ment]: Database Applications - Data mining 
General Terms: Algorithms, theory, experimentation. 
Keywords: Network communities, Affiliation networks, Social net- 
works. 

1. INTRODUCTION 

Nodes in networks organize into densely linked groups that are 
commonly referred to as network communities, clusters or mod- 
ules [|9l|23]. Studying networks at the level of communities is very 
useful as there are many reasons why social, information and tech- 
nological networks organize into communities. For example, so- 
ciety is organized into groups, families, friendship circles, villages 
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and associations J7j|26l- On the World Wide Web, topically related 
pages may link more densely among themselves (8). 

Even though extracting network communities is a fundamental 
problem, understanding of the networks at the level of network 
communities has been relatively limited due to several challenges l24l 
|T8l . There exists a large number of different definitions and models 
of network communities [9 24], and many formalizations of com- 
munity detection lead to intractable NP-hard optimization prob- 
lems [25|. Moreover, the lack of reliable ground-truth makes the 
evaluation of such models extremely difficult. 

Present work. In this work we study the connectivity structure 
of ground-truth communities in order to develop models for net- 
work communities. We identified a set of 6 different large social, 
collaboration, and information networks where we can reliably de- 
fine the notion of ground-truth communities. Networks we study 
come from a number of different domains and research areas. In all 
these networks nodes explicitly state their ground-truth community 
memberships. The size of the networks ranges from hundreds of 
thousand to hundreds of millions of nodes and edges. The networks 
represent a wide range of edge densities, numbers of ground- truth 
communities, as well as sizes and amounts of community overlap. 
The availability of reliable ground-truth communities has a pro- 
found effect. It allows us to quantify the structure of ground-truth 
communities and build better models of how nodes organize into 
communities. 

We study how ground-truth communities of nodes connect inside 
the community, how they connect to the rest of the network, and 
how they overlap. This way we can empirically study on a large 
scale how real communities map on the underlying social network 
structure, and how real communities overlap and interact. 

We show the following for a broad range of networks across di- 
verse domains: First, communities in social networks contain high- 
degree connector nodes that have an edge to most of the community 
members. Second, the overlaps between communities are more 
densely connected than the non-overlapping parts. We view the 
second of these findings as particularly surprising as it goes against 
the conventional wisdom that communities (both overlapping and 
non-overlapping) are more densely connected than their boundaries 
or the overlaps themselves [Q][22]. Thus, rather than shedding light 
on the debate on whether and how much communities overlap in 
networks, our findings suggest the need to revisit standard models 
of community structure to account for the fact that nodes in the 
overlap of communities are more likely to be connected. 

What underlying process causes community overlaps to be denser 
than the communities themselves? This question motivates the sec- 
ond main contribution of this work: We present a family of prob- 
abilistic generative models for graphs that capture the observed 
phenomena and produce graphs with realistic community structure. 



We build on models of affiliation networks 0[14) and develop the 
Community -Affiliation Graph Model (AGM) which reliably repro- 
duces the organization of networks into communities and the over- 
lapping community structure. In the affiliation network, member- 
ships of nodes to communities are modeled with a bipartite graph, 
where on the "left" we have the nodes of the social network, and on 
the "right" are the nodes representing communities. The edges of 
this bipartite graph model node-community affiliations. The cen- 
tral idea in generating social networks based on the affiliation net- 
work is that links among people stem from one or more common 
or shared community affiliations 0. 

In our model communities arise due to shared group affiliations 
[261171. We model the probability of an edge between a pair of nodes 
as a function of the communities that the two nodes share. Com- 
munity assignments in our model are probabilistic which allows for 
flexibility in the structure of community overlaps. We mathemat- 
ically analyze the AGM and obtain rigorous results showing that 
the model leads to community structure and overlaps as observed 
in real data. Experiments on a range of network datasets establish 
that the AGM reliably captures node community memberships, in- 
ternal structure of the groups and generates realistic group overlaps. 

Overall, our work has three main contributions: 

• Identification of networks with explicit notion of ground- 
truth communities. 

• Novel observation that community overlaps are densely con- 
nected. 

• Community- Affiliation Graph Model that explains the emer- 
gence of dense community overlaps and accurately models 
network community structure. 

Our results have implications in several contexts: 

• Design of new community detection methods: Nearly all com- 
munity detection methods assume sparse community over- 
laps l22l [Tl. This means that these methods cannot prop- 
erly detect communities in large networks - they would ei- 
ther mistakenly identify the overlap as a separate community 
or merge two overlapping communities into a single commu- 
nity. Thus, our findings have important implications for the 
development of new network community detection methods. 

• Evaluation based on ground-truth communities: Our iden- 
tification of networks with explicit ground-truth communi- 
ties allows for quantitative evaluation: based on ground-truth 
communities we can evaluate the accuracy, i.e., what fraction 
of the members of the ground-truth community a particular 
method identified. 

• Synthetic benchmarks: Our model can be used to generate 
synthetic benchmark datasets for evaluation and analysis of 
network community detection methods. 

2. RELATED WORK 

It is important to note the fundamental contrast between one of 
our main findings here — that the community overlaps are denser 
than communities themselves — and a massive body of work on 
network community detection {i.e., unsupervised graph clustering 
problem of inferring communities; see [28, 25, 9| for surveys of 
this area). While community detection work seeks to infer poten- 
tial communities in a network based on density of linkage, we start 
from the other end of the problem. We start with a network in which 
the communities have already been explicitly identified and seek to 
model their structure. Thus, our work here is directly relevant for 
community detection as it identifies the properties of real commu- 
nities and a presents a realistic model. A natural next step (that we 



do not address here) is then to fit the model to a given graph and 
identify communities. 

We also note that the finding that community overlaps are denser 
than communities themselves nicely extends the notion of homophily 
in networks |[T9l . The 'strength of weak ties' [fTOl and small- world 
models l|29l lead to the idea that homophily in networks operates in 
small pockets where inside the pocket nodes link strongly among 
themselves, and weakly to other pockets. In this respect our work 
here represents an extension to the understanding of homophily. In 
a sense we are discovering pluralistic homophilyQ where the simi- 
larity of one node to another is the number of shared affiliations, not 
just their similarity along a single dimension. This view of tie for- 
mation is consistent with the works of Simmel [ 26 1 on the web of 
affiliations, and Feld [7] on focused organization of social ties. In 
both views networks compose of overlapping tiles or social circles 
that serve as organizing principles of nodes in networks. 

There has also been considerable work on probabilistic models 
for graph generation. The discovery of degree power-laws and 
other properties of static and dynamic graphs led to the develop- 
ment of random graph models that exhibited such properties [fT2l 
[HlCLZlllD. See ED [6) for surveys of this area. The main differ- 
ence here is that our goals are more ambitious as we aim to accu- 
rately model both the overall network structure as well as the node 
community memberships and the overlaps of communities. 

Our Community- Affiliation Graph Model (AGM), which pro- 
duces realistic graphs as well as community overlaps, is an example 
of a bipartite affiliation network model 0[l4][3T]. Affiliation net- 
works have been extensively studied in sociology as a metaphor 
of classical social theory concerning the intersection of persons 
with groups, where it has been recognized that communities arise 
due to shared group affiliations 0|26]. In affiliation network mod- 
els, nodes of the social network are affiliated with communities 
that they belong to and the links of the underlying social network 
are derived based on the community affiliation network. The most 
related to our model is the work of Lattanzi and Sivakumar [fT4l 
who studied the macroscopic evolution of networks and proposed 
an affiliation network model for social networks. They proved that 
networks arising from the model exhibit power-law degree distri- 
butions, densification power law and shrinking diameter. There is a 
small but crucial difference between the two models. iTPfl posed a 
model where edge creation probability decreases with community 
size. We relax this assumption and allow AGM communities to 
have arbitrary edge probabilities. This way Community- Affiliation 
Graph Model acquires the necessary flexibility to accurately model 
the community structure of real- world networks. 

3. DATASET DESCRIPTION 

We aim to identify networks where nodes explicitly state their 
community memberships. Ideally such ground- truth communities 
would be real communities with shared values, mutual influence 
and common purpose or function. We consider a set of 6 large so- 
cial, collaboration and information networks, where for each net- 
work we identify a graph and a set of ground- truth communities. 
We identify networks where nodes explicitly state their ground- 
truth community memberships. Members of these ground- truth 
communities share properties or attributes, common purpose or 
function. We did our best to identify networks in which such ground- 
truth communities can be reliably defined and identified. Networks 
that we study come from a variety of domains. Their size ranges 
from hundreds of thousand to hundreds of millions of nodes and 
billions of edges. 

1 We thank Michael Macy for coining this term for us. 




Figure 1: Live Journal group statistics: (a) Group size distri- 
bution, (b) Number of group memberships per node. Other 
networks have similar behavior (not shown for brevity) 1 30 |. 
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LiveJournal 


4.0 M 


34.9 M 


311,782 


40.06 


3.09 


Friendster 


117M 


2,586.1 M 


1,449,666 


26.72 


0.33 


Orkut 


3.0 M 


117.2 M 


8,455,253 


34.86 


95.93 


DBLP 


0.4 M 


1.3 M 


2,547 


429.79 


2.57 


IMDB 


1.3 M 


39.8 M 


205 


6,688.78 


1.00 


Amazon 


0.3 M 


0.9 M 


49,732 


99.86 


14.83 



Table 1: Dataset statistics. N: number of nodes, E: number of 
edges, C: number of communities, S: average community size, 
A: community memberships per node. M denotes one million. 

First we consider online social networks (the LiveJournal b log- 
ging community (3), the Friendster online network ||20l . and the 
Orkut social network [20]) where users create explicit groups which 
other users then join. Such groups serve as organizing principles of 
nodes in social networks and are focused on specific topics, inter- 
ests, hobbies, affiliations, and geographical regions. Communities 
range from small to very large and are created based on specific top- 
ics, interests, hobbies and geographical regions. For example, Live- 
Journal categorizes communities into the following types: culture, 
entertainment, expression, fandom, life/style, life/support, gaming, 
sports, student life and technology. For example, there are over 100 
communities with 'Stanford' in their name, and they range from 
communities based around different classes, student ethnic com- 
munities, departments, activity and interest based groups, varsity 
teams, etc. Overall, there are over three hundred thousand explic- 
itly defined communities in LiveJournal. Figure Q] gives the statis- 
tics of group sizes and the number group memberships of nodes in 
LiveJournal. Similarly, users in Friendster as well as in Orkut de- 
fine topic-based communities that others then join. Both networks 
have more than a million explicitly defined groups and each user 
can join to one or more such groups. We consider each group as a 
ground-truth community. 

The second type of network data we consider is the Amazon 
product co-purchasing network [15]. The nodes of the network rep- 
resent products and edges link commonly co-purchased products. 
Each product (i.e., node) belongs to one or more hierarchically or- 
ganized product categories and products from the same category 
define a group which we view as a ground-truth community. This 
means members of the same community share a common function 
or role, and each level of the product hierarchy defines a set of hi- 
erarchically nested and overlapping communities. 

Finally, we also consider the collaboration networks of DBLP 
and IMDB 1 3 ] where nodes represent authors/actors and edges con- 
nect nodes that have co-authored a paper/co- appeared in the movie. 
In DBLP we use publication venues as ground-truth communities 
which serve as proxies for highly overlapping scientific communi- 
ties around which the network then organizes. In IMDB we found 
that language is a good way of defining ground- truth communities. 

Table Q] gives the dataset statistics. Observe that the size of net- 
works ranges between hundreds of thousands to hundreds of mil- 
lions of nodes and billions of edges. The number of ground- truth 
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(a) Edges inside the community (b) Maximal ICDF 

Figure 2: (a) Edges inside the community, and (b) Maximal 
Internal Community Degree Fraction as a function of the com- 
munity size. 

communities varies from hundreds to millions and there is also a 
nice range in group sizes and the node membership distribution. 

All our networks are complete and publicly available: LiveJour- 
nal 0, Friendster J20), Orkut J20], Amazon [15], DBLP E] and 
IMDB 00 For each of these networks we identified a sensible 
way of defining ground-truth communities that serve as organiza- 
tional units of these networks. Note that we are careful to define 
ground-truth communities based on common affiliation, social cir- 
cle, role, activity, interest, function, or some other property around 
which networks organize into communities 171 HOl. 

Even though our networks come from very different domains 
and have very different motivation for formation of communities 
the results we present here are consistent and robust. Our work 
is consistent with the premise that is implicit in all network com- 
munity literature: members of real communities share some (la- 
tent/unobserved) property or affiliation that serves as an organizing 
principle of the nodes and makes them well-connected in the net- 
work. Here we use these groups around which communities orga- 
nize to explicitly define ground-truth. And, as we will later see, the 
ground-truth communities exhibit connectivity patterns that match 
our intuition of communities as densely connected sets of nodes. 

Data preprocessing. To represent all networks in a consistent 
way we drop edge directions and consider each network as an un- 
weighted undirected static graph. Because members of the group 
may be disconnected in the network, we consider each connected 
component of the group as a separate ground-truth community. 
However, we allow ground-truth communities to be nested and to 
overlap (i.e., a node can be a member of multiple groups at once). 

4. STRUCTURE OF COMMUNITIES 

We now proceed to discuss our empirical findings that motivate 
the model we later develop. We focus our analyses on two aspects 
of connectivity structure of ground- truth communities: First, we in- 
vestigate the connectivity properties of communities, i.e., we study 
the structure of induced graphs on the set of community members 

5. Second, we study connectivity patterns of community overlaps 
and investigate the amount of edge clustering in the overlap versus 
the clustering in the non-overlapping parts of the community. 

Connectivity of communities. We first examine the relation be- 
tween the size of the community (i.e., \S\) and the number of edges 
between the nodes of the community (Es = {(u,v)\u,v £ S, (u, v) £ 
E)}). Figure [2(a)] shows the relation. Interestingly, across the 
range of datasets we consistently observe a form of the Densifi- 
cation Power Law |[T6l where the number of edges in the commu- 
nity increases superlinearly with the community size, \Es\ oc \S\ a . 
We observe that all three online social networks (Orkut, LiveJour- 

2 All networks and the corresponding ground- truth communities are 
available at |http : //snap . Stanford . edu/data| 
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Figure 3: (a) Overlap O of communities A and B. (b) Commu- 
nity Affiliation Network constructed from (a). 



nal and Friendster) have densification exponent a « 1.5. We also 
note a similar exponent for IMDB. On the other hand DBLP and 
Amazon have lower value of a ~ 1.1. However, it is impor- 
tant to correctly interpret these findings. Even though the absolute 
number of one's friends that are in the community increases with 
the size of the community {i.e., the number of edges increases su- 
perlinearly with the number of nodes), the fraction of community 
members whom one is friends with {i.e., density of the community) 
decreases with the community size (since a < 2). This suggests 
that when the community is small, its members build relationships 
among themselves, whereas in a large community, members are 
less embedded into the community. 

The distinction between small and large communities becomes 
even clearer once we examine the existence of a connector/hub 
node in the community. To investigate this, we first define the In- 
ternal Degree (ID) di n {u, S) of node u in a community S to be 
the number of members of S to which u is connected, di n (u, S) = 
\{v\v £ S, (u, v) £ Es}\. Then we define the Maximal Internal 
Community Degree Fraction (ICDF) fin(S) of community S to be 
the maximal fraction of community members any member node is 
connected to, fin(S) = max ue s di n (u, S)/\S\. For example, a 
Maximal ICDF of 0.7 of community S means that there exists a 
node u £ S that is connected to 70% of all of the members of S. 

Figure |2(b)| plots the average Maximal ICDF as a function of 
community size. We observe that in communities smaller than 
^100 nodes, there exist connector nodes that link to more than 80% 
of all the community members. However, as the community size in- 
creases beyond 100, the Maximal ICDF tends to quickly decrease. 
This is interesting as it suggests that smaller communities tend to 
organize themselves around connector node(s) and thus share com- 
mon bonds. There are two exceptions to this both of which can be 
nicely explained. Amazon is a product co-purchasing network and 
thus there is no internal reason why such connector nodes should 
exist. And the presence of a connector node in the DBLP network 
would suggest existence of special publication venues where the 
connector node would co-author all the papers, which again is not 
realistic. 

Edge probability as a function of shared communities. Commu- 
nities in networks form overlaps when there exist nodes that belong 
to multiple communities. Figure [3(a)! illustrates the setting of two 
communities A and B, and nodes that belong to both communities 
reside in the overlap O. We study the structure of group overlap 
by simply asking what is the probability that a pair of nodes is con- 
nected if they share k common community memberships, i.e., the 
nodes belong to the overlap of k communities. Figure |4] plots this 
probability (the red 'Data' curve) for all six datasets. The figure 
also plots (the green 'AGM' curve) the same quantity as modeled 
by our Community- Affiliation Graph Model that we will describe 
in the next section. 



First, notice that all curves are generally increasing. This means 
that, the more communities a pair of nodes has in common, the 
higher the probability of an edge. In LiveJournal, for example, if 
a pair of nodes has 8 groups in common, the probability of friend- 
ship is nearly 80%. To appreciate how strong the effect of shared 
communities is on the edge probability, one has to note that all 
our networks are extremely sparse. The background probability 
of a random pair of nodes being connected is « 10 -5 , while as 
soon as a pair of nodes shares two communities, their probabil- 
ity of linking increases by 4 orders of magnitude (from 10 -5 to 
10 _1 ). We note that all other data sets have similar shapes — the 
probability of a pair of nodes being connected approaches 1 as the 
number of common communities increases. While in online social 
networks the edge probability exhibits a diminishing-returns-like 
growth, in other datasets (IMDB, DBLP, Amazon) it appears to 
follow a threshold-like behavior. 

In retrospective the above result is very intuitive. For example, 
in the context of social networks two students that belong to both 
a Tuesday salsa club and a Sunday Movie club are more likely to 
meet each other than if they would belong to only a single club. 
Thus, the more communities nodes share, the more likely they are 
to meet and interact. Communities thus serve as organizing princi- 
ples of nodes in social networks and are created on shared affilia- 
tion, role, activity, social circle, interest or function. 

Connector resides in the overlap. The next question we investi- 
gate is whether the connector node (the node that is connected to 
the most other members of a community) belongs to the overlap. 
We extract all pairs of communities (A, B) that have non-empty 
overlap O, and then compute the probability that the connector of 
community A is in the overlap O. Figure [5] shows this probabil- 
ity as a function of the fraction of nodes in the community overlap 
(|0|/|A|). If the connector node would be a random node in a 
community, then the probability of a connector node belonging to 
the overlap is equal to the fraction of the nodes in the overlap, i.e., 
the probability of connector being in O is |0|/|A|. However, we 
find that the probability of connector being in the overlap increases 
super- linearly with the size of the overlap in all the data sets. This 
demonstrates that connector nodes tend to reside in group overlaps 
and are not central to a single community. 

Based on all these results, we believe that the dense community 
overlaps in our study reflect a fundamental property of the underly- 
ing networks. Understanding the possible causes for this property 
will be the subject of the next section. 

5. IMPLICATIONS OF DENSE COMMUNITY 
OVERLAPS 

Even though our findings above are intuitively natural, we note a 
sharp contrast between the current understanding of network com- 
munities and our findings. Present view of network communities is 
based on two fundamental social network theories: triadic closure 
l29l and 'strength of weak ties' flOl . Building on these two theories 
leads to a picture of network communities as illustrated in Figure 
|6(a)| Nodes inside communities link densely to one another while 
there are relatively few edges between the groups. Early works on 
network community detection, e.g., Newman's betweenness cen- 
trality [9||, Modularity optimization [9| as well as graph partition- 
ing methods all adopt this view of network communities. This view 
of communities has another important consequence. It suggests 
that homophily in networks operates in small pockets where nodes 
gather in dense non-overlapping clusters (Fig. |6(a)) . 

In networks communities also tend to overlap as nodes can be- 
long to multiple communities at once (and thus residing in the over- 
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Figure 4: Edge probability between two nodes given the number of common communities that they belong to. 



lap) l22l . Applying the conventional view of network communities 
in this case leads to the (unnatural) structure of community over- 
laps as illustrated in Figure |6(b)| Present models of overlapping 
communities (TJ [13] |22j assume that community overlaps are less 
densely connected than the groups themselves. This means that 
they assume that the probability of an edge between a pair of nodes 
decreases with the number of shared community memberships. As 
a consequence this means that present methods cannot properly de- 
tect communities in large networks - they would either mistakenly 
identify the overlap as a separate community or merge two overlap- 
ping communities into a single community. 

Our finding that, the more community affiliations a pair of nodes 
shares, the more likely they are connected, suggests community 
overlaps as illustrated in Figure |6(c)| This view of network for- 
mation differs from what has been assumed in the past and is con- 
sistent with early works in social network analysis. In particular, 
works of Simmel on the web of affiliations (26), and Feld on the 
focused organization of social ties Q view networks as being com- 
posed of overlapping tiles or social circles that serve as organizing 
principles of nodes in networks. Our work suggests exactly the 
same analogy: Network community can be thought of as overlap- 
ping tiles and areas of the network where more tiles overlap natu- 
rally contain more connections. 

With respect to homophily, our work extends its notion. In a 
sense we are discovering pluralistic homophily where the simi- 
larity of one node to another is the number of shared affiliations, 
not just their similarity along a single dimension. This means that 
homophily does not operate in concentrated pockets but rather as 
overlapping tiles. Last, the fact that regions of the network where 
communities overlap are more densely connected also nicely ex- 
plains the co-existence of communities and the global core-periphery 
structure that has been observed in many networks [HTH . 

6. COMMUNITY-AFFILIATION 
GRAPH MODEL 
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Figure 5: Probability of a connector node belonging to the 
community overlap as a function of the fraction of community 
members in the overlap. 

In the following, we would like to find some simple, conceptual 
model of behavior, which could naturally lead to the phenomena 
that we have observed. Building on Breiger's foundational work 
where it has been recognized that communities and "cliques" 
arise due to shared group affiliations |26] |7], we present the 
Community -Affiliation Graph Model (AGM), a family of simple 
probabilistic generative models for graphs that capture the observed 
phenomena and reliably reproduce the organization of networks 
into communities and the overlapping community structure. 

Community-Affiliation Graph Model. We build our model on 
the following intuition. Consider a pair of people that are members 
of several different interest based communities. Then, by having 
more interests in common, they are more likely to meet and link. 
Our model is thus based on two main ingredients. First ingredient 
is a bipartite affiliation network that links nodes of the social net- 
work to communities that they belong to. The second ingredient 
is the insight that each community also carries a single parameter 
that captures the probability that nodes belonging to that commu- 
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Figure 6: Conventional view of (a) two non-overlapping, and 
(b) overlapping communities. Top: network, bottom: corre- 
sponding adjacency matrix, (c, d) Community overlaps as sug- 
gested by our findings. 

nity to share a link. Thus, naturally, the more communities a pair 
of nodes shares, the higher is the probability of linking. Figure [3(b)] 
illustrates the essence of our model. We start with a bipartite graph 
where the nodes at the bottom represent the nodes of the social net- 
work and the nodes on the top represent communities. The edges 
between nodes of the social network and the communities indicate 
community memberships. We denote the bipartite affiliation net- 
work as B(V, C, M), where V denotes the set of nodes of the orig- 
inal social network G, C is a set of communities, and there is an 
edge (u,c) £ M from node u £ V to community c £ C if node u 
belongs to community c. 

Now, given the affiliation network B(V, C, M), we want to gen- 
erate a social network graph G(V,E). To achieve this we need to 
specify the process which generates the edges EofG given the af- 
filiation network B. We consider a simple parameterization where 
we assign a parameter p c to every community c £ C. Parame- 
ter p c models the probability of an edge between two members of 
community c. In other words, we simply generate an edge between 
a pair of nodes that belong to community c with probability p c . 
Each community c creates edges independently. However, if the 
two nodes have already been connected via some other common 
community membership, then the duplicate edge is not included in 
the graph G(V,E). 

Definition 1. Let B(V, C, M) be a bipartite graph where V is 
a set of nodes, C is a set of communities, and an edge (u, c) £ 
M connects node u £ V to community c £ C if u belongs to 
community c. Also, let {p c } be a set of probabilities for all c £ C. 
Given affiliation network B(V,C, M) and {p c }, the Community- 
Affiliation Graph Model generates a graph G(V, E) with the node 
set V and the edge set E as follows. For each pair of nodes u, v £ 
V, the AGM creates edge (u, v) £ E with probability p(u, v): 

P (u,v)=i- n a-pfc), a) 

k £ C uv 

where C uv C C is a set of communities that u and v share (C uv = 
{c\(u,c),(v,c) £ M}). 



Note that this simple process already ensures that pairs of nodes 
that belong to multiple common communities are more likely to 
link. This is due to the fact that nodes that share multiple com- 
munity memberships get multiple chances to create a link. For 
example, pairs of nodes in the overlap of communities A and B 
in Figure [3(b)] get two chances to create an edge. First they can get 
connected with probability pa (due to their membership in commu- 
nity A) and then also with probability pb (due to membership in 
B). While pairs of nodes residing in the non-overlapping region of 
A link with probability pa, nodes in the overlap link with probabil- 
ity 1 - (1 —pa)(1 ~Pb) = Pa+Pb-PaPb > Pa, which already 
ensures that overlaps of communities are more densely connected 
than the non-overlapping parts. 

Last, we also point out the flexible nature of the Community- 
Affiliation Graph Model, which allows for modeling a wide range 
of network community structures. The flexibility of the affiliation 
network structure allows for modeling non-overlapping, hierarchi- 
cally nested as well as overlapping communities (Figure [6). 

Properties of the AGM. The elegant nature of our models allows 
for mathematical analysis. Next, we derive several properties of the 
AGM networks that match the observations from Section [4] The 
aim of the analysis is to provide simple analytical results that illus- 
trate that the AGM naturally obeys the empirical observations. 

Observation 1. The expected number of edges \E C \ between 
the nodes of community c increases super-linearly as a function 
of the number of the nodes n c that belong to c, if p c is set to be 
proportional to with 1 > /3 > 0. (Observation in Figure \2(a)\ . 

Proof Sketch \E C \ = nc(n 2 c-1) • p c oc n\p c . As p c oc in the 
simplified AGM, we have \E C \ oc n\~^ . As f3 < 1, \E C \ grows 
super-linearly. 

Observation 2. The fraction of connected neighbors in the 
overlap of two communities is higher than in the non-overlapping 
part of a single community. 

Proof Sketch The fact naturally follows from Definition Q] 

Observation 3. Given communities A, B and their overlap 
O, the probability that the connector node ( the node that is con- 
nected to most other members of a community) of A is in overlap 
O is higher than \0\/\A\ (Observation in Figure\5§. 

Proof Sketch AGM generates edges among the nodes in commu- 
nity A (node set ua) independently with probability pa (G nA , PA ), 
and edges among the nodes in B independently with prob. pb 
(Gn B , PB ). Let X be the event that a particular node o £ O is the 
connector of A. The probability of the connector node being in O is 
the sum of X Q for each o £ O, ^2 ^ p{X ), which is \0\ -p(X ) 
as p(X ) is the same for any o £ O. Now, it suffices to show that 
p(X ) > 1/| A |. X is equivalent to the event that the internal de- 
gree of o and A, di n (o, A), is the maximal among di n (u. A) of all 
u £ A. o is connected to other nodes in A by both G nA , PA and 
Gn B , PB as Gn B , PB connects o to other nodes in O C A. Only if 
Gn B , PB does not make any connection between o and other mem- 
bers in O, di n (o. A) has the same distribution as di n (u, A) of any 
others £ A, and thus p(X Q ) — 1/\A\. However, G nB , PB con- 
nects o to other members in O with positive probability, and thus 
p (X ) is strictly higher than 1 / 1 A \ . 

Observation 4. Ifp c = p for all communities c, the condi- 
tional edge probability between two nodes is an increasing function 
of the number of communities that the both nodes belong to ( Ob- 
servation in Figure^. 
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Figure 7: AGM allows for rich modeling of network communities: (a) non-overlapping, (b) nested, (c) overlapping. In (a) we can 
assume that nodes in disjoint communities connect with small prob. e which allows for sparse links between communities A and B. 



Proof Sketch When two nodes belong to k common communities, 
the AGM connects the two nodes with probability 1 — (1 — p) k , 
which is an increasing function of k. 



7. MODEL EVALUATION 

Having defined the AGM, we now proceed to investigate its prop- 
erties. We perform a set of simulation experiments and compare 
our model to other models of network community structure. For 
each network, we generate a synthetic network with the AGM, and 
then compare the synthetic network and the synthetic community 
structure to the structure of the real network. For a compari son, we 
use the model of network community structure proposed by Lan- 
cichinetti et al. tl3\ . We refer to the model as the LFR. The LFR 
model is the state-of-the-art model for generating networks with 
overlapping community structure that can then be used for eval- 
uating community detection methods. Our goal is to understand 
whether the AGM qualitatively reproduces structural properties of 
real networks, real communities and real community overlaps. 

Experimental setup. In order to compare real networks to the syn- 
thetic networks generated by the AGM and LFR, we need to set pa- 
rameters of both models. Both the AGM and the LFR require bipar- 
tite affiliation networks, and for simplicity we construct the com- 
munity affiliation network from the node community membership 
information. Then, we use maximum likelihood estimation to fit 
the parameters of the LFR as well as the AGM. For AGM, we fit a 
set of probabilities {p c }. (We discuss the fitting of {p c } via convex 
optimization in the next subsection.) The LFR requires the follow- 
ing parameters: the power-law coefficient of the network degree 
distribution and the fraction of external edges of each node. We 
estimate these parameters from the real network using maximum 
likelihood estimation. Note that when computing the community 
internal degree of a node, LFR penalizes nodes in the overlap: the 
internal degree is inversely proportional to the number of commu- 
nities that the node belongs to. Therefore, LFR also assumes sparse 
community overlaps. 

We then compare the networks synthesized by the two models to 
the ground- truth networks. We investigate three criteria: structure 
of communities, structure of community overlaps, and structure of 
the networks themselves. For each of the six datasets we repeat the 
measurements from Section [4] on real and on synthetic networks 
and examine the performance of the two models. For brevity we 
focus on Live Journal and refer the reader to the extended version 
of the paper [30] for results on other datasets. 

Estimating p c via convex optimization. Given a graph C7(V, E) 
and a bipartite community affiliation network B(V,C, M), we aim 
to find parameters {p c } that maximize the likelihood of observed 
edges in G: 




(a) Edges Inside the group (b) Maximal ICDF 

Figure 8: Live Journal community properties. 

l({pc})= n p( u > v ) n (i-pm)- (2) 

(u,v)eE (u,v)£E 

By applying Eq.Q]we transform the optimization problem to: 
argmax £ (1 - JT (1 - Pk )) £ ( £ (l_ pt )) 

Pc (u,v)EE keC uv (u,v)&E kec uv 

with constraints < p c < 1. 

This optimization is nontrivial to solve. The objective function 
is non-convex as it involves a product over the variables pk- Now, 
we show that it can be converted to a convex optimization problem. 

We maximize the logarithm of the likelihood and perform a change 
of variables 1 — pk = e~ Xk : 



are; max 

{*c} 



E Mi 



" Efcg 



(u,v)eE 



E E 



and constraints < p c < 1 become x c > 0. This transformed 
problem is a convex function of {x c } and thus the globally opti- 
mal optimal values of {x c } can be efficiently found. Then, by the 
change of variables, we find the values of {p c }- 

Evaluation: Properties of communities. First, we compare the 
connectivity patterns of communities in the synthetic networks to 
the ground- truth Live Journal network. We perform measurements 
analogous to those in Figure E] and plot the results in Figure [8] 
We overlay the original results from LiveJournal (Figure with 
the properties of communities in the synthetic AGM and LFR net- 
works. We observe that the AGM much better captures connectivity 
patterns of communities. For example, communities in the AGM 
tend to have connector nodes when the community size is smaller 
than 100 nodes (Figure [8(b)] ). Similarly, the AGM also captures the 
community densification power-law nearly perfectly (Figure [8(a)}. 

Evaluation: Community overlaps. In Section[4] we observed that 
nodes in the overlap (multi membership nodes) have higher connec- 
tivity than the nodes not in the overlap (single membership nodes). 
We also noted that an overlap is likely to contain the connector node 




Common memberships Fraction of the overlap 



(a) Edge probability (b) Connector inside the overlap 

Figure 9: Live Journal community overlaps. 

of the group, and the overlap becomes denser as more groups join 
the overlap. We examine how well the two models mimic these 
patterns of group overlaps. 

First, we found that the edge probability between a pair of nodes 
is an increasing function of the number of communities that the 
nodes share (Figure |4). Figure [9(a)] plots the edge probability as 
a function of the number of common communities between a pair 
of nodes for the LiveJournal network and compares it to the two 
models. Notice that the AGM successfully reproduces the edge 
probability, while LFR fails to model the fact that nodes that share 
more communities tend to be more likely to be connected. In fact, 
Figure |4] shows the edge probability for all six datasets (red lines) 
and the same probability as modeled by the AGM (green lines). 
Notice that the AGM is able to capture a wide range of behaviors 
— from diminishing-returns (Fig. |4(a)) , S-shape (Fig. |4(d)| ) to a 
slowly rising pattern (Fig. \4(f)\ 

Second, we also observed that real communities have a connec- 
tor node and the connector node is more likely to exist inside the 
overlap (Figure [5). We validate the presence of the connector node 
in synthetic communities in Figure |9(b)| We compute the proba- 
bility that the overlap O between two communities A and B has a 
connector node of either community A or B. Notice a very close 
fit to the LiveJournal network. In contrast, in LFR, the probability 
of a connector being in the overlap is much lower, which confirms 
that the overlaps between LFR communities are less dense than a 
single community. 

This is interesting as it explains why communities in the AGM 
tend to have connector nodes (Figure [8(b)) . Since edges inside each 
AGM community are created independently (which by itself does 
not produce skewed node degrees) we would naturally expect that 
all the nodes would have similar degrees and no connector nodes 
would emerge. However, since the overlaps are dense, the nodes 
in the overlap tend to have higher degrees and emerge as connector 
nodes (Fig. |8(b)| . LFR explicitly forces a heavy-tailed degree dis- 
tribution which makes a few nodes have very high degrees. How- 
ever, LFR communities do not have a connector node. This is be- 
cause LFR prevents the nodes in the overlaps from forming edges 
in a single community and thus the connector node links to a small 
fraction of community members. 

We now briefly mention the performance of the AGM on all 
other datasets. For each structural property, we measure the quality 
of fit between that property in the synthetic AGM and LFR net- 
works and the real data. We apply the Kolmogorov-Smirnov (KS) 
statistic, which is a non-parametric way of quantifying the distance 
between two distribution functions. Given two distribution func- 
tions f(x),g(x), i.e., plots of the same structural property, the KS- 
statistic computes the maximum difference between the cumula- 
tive area under the two curves, KS(f,g) = sup x | f x f(t)dt — 
f x g(t)dt\. We compute the KS-statistic between the AGM curve 
(or the LFR curve) and the true curve for all the following proper- 
ties: 

• VOL: Edges inside the community. 
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0.98 


0.72 


0.53 


0.73 


0.86 


0.77 


0.76 


Friendster 


1.00 


0.55 


0.70 


0.84 


0.63 


0.49 


0.70 


Orkut 


0.99 


0.67 


0.57 


0.95 


0.50 


0.21 


0.65 


DBLP 


0.99 


0.05 


0.88 


0.66 


0.66 


0.56 


0.63 


IMDB 


0.93 


0.36 


0.17 


0.39 


0.16 


0.21 


0.37 


Amazon 


0.91 


0.73 


0.35 


-0.79 


0.80 


0.76 


0.46 


Average 


0.97 


0.51 


0.53 


0.46 


0.60 


0.50 


0.60 



Table 2: Community connectivity and overlaps: Relative im- 
provement in the KS-statistic of the AGM over LFR. 



• MID: Maximum Internal Community Degree Fraction. 

• EP: Edge probability between nodes. 

• PC: Probability of a connector residing in the overlap. 

• OO: Fraction of connected neighbors in the overlap. 

• AABB: Fraction of connected neighbors in a community. 

Table |3 reports the relative improvement in the KS-statistic be- 
tween the two models (the difference between the two models nor- 
malized by the larger value of the two). The value of the relative 
improvement can be between 1 (the AGM completely outperforms 
LFR) and -1 (LFR completely outperforms the AGM). The AGM 
shows a relative improvement of 60%, which means that the AGM 
outperforms LFR by a factor of 2. Furthermore, the AGM shows 
significantly lower average KS-statistics in every property as well 
as on every network. Overall, we conclude that the AGM reliably 
captures the properties of real overlaps and real groups and signifi- 
cantly improves over previous state of the art models. 

Evaluation: Network properties. Last, we also study whether the 
AGM is able to generate overall realistic networks. We examine 
how well the global structural properties of the synthetic networks 
match the properties of the ground- truth network. For each of the 
networks synthesized by the AGM and LFR, we quantify the degree 
of agreement between the real and synthetic network by computing 
the KS-statistic on the following network properties: 

• Degree distribution (Deg): histogram of the number of edges 
of a node. Networks tend to have power-law degrees 

• Clustering coefficient ( CCF): distribution of clustering coef- 
ficient of nodes ll29l . 

• Hop plot (Hop): the number of reachable pairs of nodes in 
less than x hops fT6l . 

• Triad participation (TP): the number of triangles that a node 
participates in 1771 . 

• Eigenvalues (EigVal): distribution of eigenvalues of the ad- 
jacency matrix (6). 

• Eigenvector (EigVec): distribution of components in the eigen- 
vector associated with the largest eigenvalue |6l. 

Table [3] shows the relative improvements in KS-statistics of the 
AGM over LFR. The AGM network follows very closely the pat- 
terns of the ground-truth network for most properties, like the de- 
gree distribution, triad participation, and eigenvalues. The only ex- 
ception where LFR outperforms the AGM is the Eigenvector. The 
AGM exhibits 9% better fit (KS-statistic) than LFR for Degree dis- 
tribution, 221% for Clustering coefficient, 7% for Hop distribution, 
120% for Triad participation, and 122% for Eigenvalues. For the 
Eigenvector property, the LFR exhibits a 17% better value. Overall, 
these results demonstrate that the AGM is not only able to reliably 
capture the structure of network communities and community over- 
laps but that it also accurately generates the underlying networks. 
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0.04 
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Orkut 
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0.08 


3.39 


1.12 
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DBLP 


-1.10 


2.31 


0.02 


1.07 


0.59 


-0.74 


IMDB 


-0.03 
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0.20 


0.69 


1.87 


-0.51 


Amazon 


-0.81 


2.48 


0.12 


-1.29 


-0.53 


-0.24 


Average KS 


0.09 


2.21 


0.07 


1.20 


1.22 


-0.17 



Table 3: Relative difference in the KS -statistic of AGM and 
LFR for network properties. Positive values mean that AGM 
outperforms LFR. 



8. CONCLUSION 

In this paper we identified a set of networks with explicitly de- 
fined ground-truth communities. This allowed us to investigate the 
structure and overlaps of ground- truth communities in networks. 
We observed that the overlaps of communities are more densely 
connected than the non-overlapping parts of communities, which is 
in contrast to assumptions made by present community detection 
models and methods. We also observed that ground- truth com- 
munities contain high-degree hub nodes that reside in community 
overlaps and link to most of the members of the community. We 
then presented the Community -Affiliation Graph Model (AGM), a 
conceptual model of network community structure, which reliably 
captures the overall structure of networks as well as the overlapping 
nature of network communities. 

Our results have relevance in multiple settings. First, our anal- 
ysis sheds light on the organization of complex networks and pro- 
vides new directions for research on community detection. Second, 
ground-truth communities offer a reliable ground-truth for commu- 
nity evaluation that was impossible to do before. Last, the AGM 
provides a realistic benchmark network on which new community 
detection algorithms can be developed and evaluated. 

A natural step for future work is build on these findings and de- 
sign community detection methods that can detect dense overlaps. 
Explicitly maximizing the likelihood defined in Eq. [2]over both the 
affiliation graph B as well as the parameters p c would be a good 
step in this direction. 
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